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ABSTRACT 

This paper describes a real world deployment of a context- 
aware mobile app recommender system (RS) called Frappe. 
Utilizing a hybrid-approach, we conducted a large-scale app 
market deployment with 1000 Android users combined with 
a small-scale local user study involving 33 users. The result¬ 
ing usage logs and subjective feedback enabled us to gather 
key insights into (1) context-dependent app usage and (2) 
the perceptions and experiences of end-users while interact¬ 
ing with context-aware mobile app recommendations. While 
Frappe performs very well based on usage-centric evaluation 
metrics insights from the small-scale study reveal some neg¬ 
ative user experiences. Our results point to a number of 
actionable lessons learned specifically related to designing, 
deploying and evaluating mobile context-aware RS in-the- 
wild with real users. 
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I. INTRODUCTION 

The mobile world continues to grow at a phenomenal rate. 
Once used as mere communications devices, mobile phones 
have evolved into sophisticated personal computers enabling 
access to a wealth of information and services, anytime, and 
anywhere. These smartphones now include sensors that en¬ 
able the collection and analysis of contextual information 


[e.g. GPS, accelerometer) from which we can draw insights 
about the intentions, activities and locations of mobile users. 
The world of mobile apps has also grown exponentially. Both 
Appl^and Googl^have recently reported reaching one mil¬ 
lion apps in their app stores. Users’ demand for mobile apps 
is also steadily increasing, with downloads from mobile app 
stores expected to reach 98 billion by 20110 

This increasing volume of available mobile apps has re¬ 
sulted in significant app overload and app discovery chal¬ 
lenges for mobile users. As a result, several app recom¬ 
mendation and aggregation services have emerged. Some of 
these services support the discovery of relevant apps through 
either ratings or recommendations based on user profiles, 
types of installed apps, and in some cases the usage of those 
apps, e.g. Appolicious, AppsFire, Zwapp, Appsaurus, and 
AppAware 10 . Recent work has leveraged unique mobile 


contexts, such as location to provide context-aware recom¬ 
mendations or CARS fll u 15 . These mobile contexts have 


been shown to have a significant impact on the needs, be¬ 
haviors and app usage patterns of mobile users [^[^. Hence, 
it seems likely that utilizing such contextual data in mobile 
app recommendation would lead to enhanced end-user ex¬ 
periences. 

Context-aware recommendation algorithms have been shown 
to outperform other state-of-the-art recommendation ap¬ 
proaches. However, the vast majority of these evaluations 
were conducted off-line, with a core focus on performance 
and effectiveness from an algorithmic perspective. To date, 
little is understood about (1) how useful these context-aware 
algorithms are in a real-world, in-situ scenario, nor the (2) 
the subjective perceptions and experiences of end-users with 
the recommendations provided. 

In this paper we describe a real-world deployment of a mo¬ 
bile app recommender called Frappe which provides context- 
aware mobile app recommendations by means of a Ten¬ 
sor Factorization approach [0]. Frappe’s recommendations 
leverage both implicit usage data and contextual factors to 
provide suggestions of relevant apps to each user. The core 
contributions of the this work can be summarized as follows: 


• We released anonymised context-aware app usage data 
set, which can be found at http://baltrunas.info/ 
research-menu/frappe 


^See http://bit.ly/lcKxavd 
^See http://bit.ly/ljlVJQt 
^See: http://bit.ly/13UgmLZ 
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Figure 1: Screenshots of Prappe. 


• A characterization of context dependent app usage, 
based on the data gathered via our large-scale app 
market deployment of Frappe in Google Play with 1000 
Android users; 

• Key results from our hybrid in-situ study of Frappe 
highlighting key insights into user experiences; 

• A set of actionable lessons learned related to how to 
effectively design, deploy and evaluate context-aware 
mobile RS in-the-wild with real users. 

2. RELATED WORK 

Recommenders for Mobile Apps: Earlier work on 
mobile context-aware recommender systems (CARS) has 
shown that contextual factors, e.g. time, location, activ¬ 
ity, weather, emotional state and the user’s social network, 
heavily influence the recommendation needs of people . 
Most of this research has focused on the performance and 
effectiveness of the recommender from an algorithmic and 
performance perspective, highlighting positive off-line eval¬ 
uation results, e.g. [H |15| . 

Given the rise in popularity of mobile app markets like 
Google Play and Apple’s App Store, coupled with the in¬ 
creasing volume of available mobile apps, researchers in the 
CARS space have begun to focus explicitly on the mobile 
app recommendation domain. For example, Woerndl et al. 
[26| describes a hybrid RS that can recommend mobile apps 
to users based on what other users have installed in a sim¬ 
ilar context, in this case location. Davison & Moritz 
present Applause which provides context-aware recommen¬ 
dations utilizing location as the key form of context and 
provides mechanisms for solving the new user problem. In 
[14| , Jannach & Hegelich focus on recommendation of games 
applications within a mobile Internet portal and show that 
game buying behaviors increase among users who receive 
personalized recommendations vs. non-personalized recom¬ 
mendations. AppJoy |27| supports personalized mobile app 
recommendations by combining item-based collaborating fil¬ 
tering with data on how the user actually uses his/her in¬ 


stalled apps. Offline evaluation results using a dataset of 
4600 Android users showed that users interacted longer with 
the recommended apps. AppAware recommends new 
mobile applications by making use of context in the form 
of location. It provides the user with real-time information 
of application installs, uninstalls and updates so that the 
user is made more aware of what applications other users 
are interacting with in his/her proximity. Appazaar is a 
prototype recommender system for mobile apps which uti¬ 
lizes the user’s current and historical location information as 
well as app usage to recommend apps. Data gathered via a 
Google play deployment of Appazaar was used to provide in- 
depth insights about mobile app usage by smartphone users 
1^. Most recently Bohmer et al. propose a usage-centric 
framework to evaluate app recommenders by utilizing key 
events within an mobile app’s life-cycle {e.g. recommenda¬ 
tion, install and long-term usage). 

User Studies of Recommender Systems (RS): In 
recent years, the community has begun to investigate RS 
effectiveness from a more user-centric perspective [^. Mc- 
Nee et al. highlight that user satisfaction does not al¬ 
ways correlate with high recommender accuracy and argue 
that RS evaluations should move beyond traditional accu¬ 
racy metrics and look towards more user-centric factors. Pu 
et al. [22| outline a comprehensive framework to evaluate 
the perceived qualities of recommender systems in a model 
called ResQue (Recommender systems’ Quality of user ex¬ 
perience). In [^, Knijnenburg et al. describes a framework 
for understanding the user experience of RS, describing why 
and how certain aspects of a system lead to better user ex¬ 
periences while others do not. In [^, Gremonesi et al. report 
on an empirical study involving 210 users which explored the 
users perceived quality of seven different RSs on the same 
dataset in an offline evaluation. 

While user-centric evaluations of RS have been conducted, 
the majority of these studies have taken place in lab-based 
settings, with few participants and no concrete mobile focus. 
To date, real-world deployments of mobile RS in-the-wild 
have been rather limited, with very few subjective insights 
from users. The goal of this paper is to help bridge this gap 

















































by combining a large scale deployment of a mobile app rec- 
ommender, with a smaller scale in-situ field study to gather 
interesting insights into the experiences and perceptions of 
context-aware app usage and recommendation in-the-wild. 
To the best of our knowledge this work is the hrst of its kind 
in the mobile RS space because of the hybrid approach, its 
scope and scale. 

3. FRAPPE 


Input Signal 

Values 

Installed apps 

All non-system installed applications on the An¬ 
droid phone. 

Used apps 

^ times application was used (in a specific context) 

Skipped apps 

Apps that were recommended but not 

viewed/installed by the user 

Viewed apps 

Apps that were recommended and installed by the 
user 

Time of the 

One of 4 possible values: Morning (6am to 12am), 

day 

Afternoon (12am to 6pm) Evening (6pm to 12pm), 
Night (12pm to 6am) 

Day of the 

One of 7 possible values: Mon, Ities, Weds, Thurs, 

week 

Fri, Sat or Sun 

Period of the 

One of 2 possible values: Weekend or Working day 

week 

{i.e. weekday) 

Location 

One of 3 possible values: Home, Work, Other 

City 

Boolean: True, if user is close (20km) to the center 
of a major city; False otherwise 

Country 

Name of the country where the user is currently 
located 

Weather 

One of 9 possible values: Sunny, Cloudy, Foggy, 
Windy, Drizzle, Rainy, Stormy, Sleet, Snowy 

Battery Level 

One of 5 possible values: Full, High, Medium, Low, 
Empty 

Energy 

Source 

One of 3 possible values: Battery, USB, AC 

Connectivity 

One of 3 possible values: WiFi, Mobile, No 

Screen State 

Boolean: True, on; False, off 


Table 1: Input signals used in Frappe engine. 


Frappe is a context-aware personalized recommender of 
mobile apps. It runs on Android phones and recommends 
the most relevant apps for the user based on his/her cur¬ 
rent situation (context) and usage patterns. Frappe auto¬ 
matically adapts to the user’s needs by utilizing informa¬ 
tion regarding installed and used apps in a variety of con¬ 
texts and settings. The full list of currently supported con¬ 
textual input signals used by Frappe is shown in Table 
Frappe’s recommendations are provided by a novel CARS 
algorithm described in which uses a Tensor Factoriza¬ 
tion approach. The model was designed to work with im¬ 
plicit feedback data for mobile app recommenders. In this 
case, the implicit data contains information about how of¬ 
ten a user used an app, for how long and in which contexts. 
However, we do not have explicit user feedback, e.g. their 
rating for a given app. Therefore, we consider app usage as 
an indirect indication of user interest in the app. Frappe’s 
algorithm was shown to achieve up to a 28% improvement 
in performance (measured in Mean Average Precision) over 
another state-of-the-art method presented in |13| in a se¬ 
ries of off-line experiments using a dataset generated by the 
Appazaaij^ mobile app recommender. 

Architecture and Logging: Frappe uses a standard 
client-server architecture. The server side computes the rec¬ 
ommendation models, and provides the top-21 recommen¬ 
dations given a client request via the HTTPS protocol. The 

^See: http://appazaar.net/ 


client side consists of an Android mobile app that (1) dis¬ 
plays the recommendations and (2) runs a background ser¬ 
vice, which collects app usage data and other contextual 
information as shown in Table This background service 
in the client, similar to [^, samples the current state of the 
phone once per minute and gathers (1) the last known read¬ 
ings of various sensors and (2) information on which appli¬ 
cations are currently in use (in the foreground) by the user. 
This data is sent in batches to the Frappe server where it 
is cleaned and enriched with additional information such 
as weather and location information abstracted at a higher 
level (i.e., home/work, country, city, etc). This architecture 
provides detailed logs on app usage with minimal impact on 
battery consumption ( < 3% of daily battery consumption). 

The Frappe Interface: Screenshots of Frappe can be 
seen in Figure]^ The main screen of Frappe ( Figure [1 (a) | l 
shows the top 3 most relevant recommendations for the cur¬ 
rent user in their current context. Users can swipe to the 
right and get more recommendations (see Figure [i(b) I. If 
the user clicks on the tile of an application, the app details 
screen appears (Figure [l(c)[ ). If the user decides to install an 
app (s)he is redirected to the Google play website for that 
app. If the user swipes to the far left, (s)he is presented 
with personalized categories (see Figure [1(e) I, i.e. the list 
of categories that are most relevant to that particular user in 
their current context. Frappe also logs all user interactions 
and clicks inside the app. Therefore, we can trace back and 
observe which apps the user viewed or installed and in what 
situations. 

Explanations of Recommendations: In addition to 
recommendations, Frappe also provides an explanation of 
the contextual factors that lead to the recommendation of 
each app (See Figure [I(d)| for an example.). The main pur¬ 
pose of these explanations is to communicate the role of 
context in making the recommendations, i.e., to share with 
the user the fact that Frappe takes their context into ac¬ 
count and to explain which specific contextual factors were 
the most important when making that particular recommen¬ 
dation. 

Frappe uses a chi-square based heuristic to generate expla¬ 
nations. The explanation engine exploits app usage statis¬ 
tics for all users to generate plausible reasons for the recom¬ 
mendation. Here we assume independence of the contextual 
factors. If an app is used significantly more often in a spe¬ 
cific context than the average app, this contextual factor is 
used to explain the recommendation. We first compute the 
Pearson’s statistic: 

2 {Oic — Eic)'^ 

X = 


Ei, 


where Oic and Eic are the observed and the expected number 
of times the app i was used in context c. To compute Eic 
we first compute the fraction of times all the apps were used 
in context c and multiply this fraction by the total number 
of times the app i was used. We use the number of distinct 
values in the contextual dimension C 9 c as the degree of 
freedom and compute the p value of the chi square statistic. 
We use up to 3 real-time contextual factors of the user with 
the highest p value and only if p < 0.1 and Oic — Eic > 0. 


4. STUDYING FRAPPE IN-THE-WILD 

In recent years researchers have begun to explore the use 
of app markets like Google Play as a means of recruiting 




















participants and running large-scale mobile user studies in- 
the-wild |19| |11[ . While this approach offers a number 

of benefits, in particular related to the amount of quantita¬ 
tive usage data collected, such studies tend to lack qualita¬ 
tive insights. In consequence, researchers in the mobile HCI 
community have proposed hybrid approaches where mobile 
applications are evaluated both globally via large-scale app 
market deployments and concurrently in smaller-scale, local 
studies [^. Such approaches lead to richer insights. We 
opted to evaluate Frappe using a similar approach which 
involved: (a) releasing the Frappe app globally via Google 
Play to attract a large user-base enabling us to gather in¬ 
teresting usage statistics; and (b) concurrently running a 
smaller-scale study in the UK with 33 participants enabling 
us to gather insights about participant subjective percep¬ 
tions and experiences with the recommendations provided. 
In the following sections we describe each deployment in 
more detail. 



Figure 2: Distribution of downloads around world. 

4.1 Large-Scale Deployment via Google Play 

Frappe was deployed in the Android Market on the Feb 
15th, 2013. We actively advertised the app among our friends, 
colleagues and in online social forums such as Reddit. In 
this paper we report the usage results derived from the data 
collected during a 2 month period from Feb 15th - Apr 
22nd 2013. During this timeframe the app was installed 
on 1000 mobile devices, equating to 340 different Android 
user agents.The distribution of user locations is shown in 
Figure We had users from 37 countries: 41% of the user 
base came from the USA, 13% from UK, 10% from Spain, 
9% from Australia. Note that majority of users were from 
English speaking countries. We also observed the primary 
language of the recorded apps was English (91% of apps). 
These are followed by 1.1% in Chinese , 1.1% in Spanish and 
0.8% in German. In total we collected approx. 351K data 
tuples in the form of user x app x context x app usage count, 
where possible values of contexts are listed in Table 

4.2 Small-Scale In-Situ Study 

In this section we describe a 3-week field study of Frappe 
with 33 Android users. 

4.2.1 Participants 

Using an external recruitment company based in the UK, 
we recruited a diverse group of active Android users. We 
define an active Android user as a user who has at least 
five mobile apps installed on their phone and who uses at 


least one of these apps once per week. Participants were 
required to own a smartphone running Android OS version 

2.2 or higher. We recruited 33 participants in total, 21 male 
and 12 female with varied age ranges: 4 participants were 
between 18-24, 11 participants were between 25-30, 6 par¬ 
ticipants were between 31-34, while the remaining 12 par¬ 
ticipants ranged in age between 35-44. 80% of participants 
had 10 or more apps installed, while the remaining 20% of 
users had 5-9 apps installed. 

4.2.2 Procedure 

We used a moderated online community forum to gather 
subjective insights from our participants. This online qual¬ 
itative research approach offers a number of compelling ad¬ 
vantages: First, online communities run over a prolonged 
period of time which allows in-depth interaction and col¬ 
laboration, thus enabling researchers to review and probe 
topics in more detail and refine questions as the study pro¬ 
gresses. Given that the duration of participant interaction 
with the researchers in an online community is typically 
longer than in face-to-face methods, this approach can lead 
to more qualitative insights and richer information. Second, 
online communities are more cost effective than face-to-face 
discussions. 

Participants were asked to install the Frappe application 
on their mobile phone and to use it for a period of 3-weeks. 
During that time participants accessed a closed, online com¬ 
munity forum each evening to answer specific questions about 
their experiences with Frappe. A new topic was posted to 
the forum 6 out of 7 days per week {i.e. everyday except 
Sundays) by a moderator over the study period. Each topic 
comprised of a set of sub-questions or tasks. Topics included: 

1. The installation, usability, functionality, ease of use 
and look and feel of the application; 

2. The perceived value and/or drawbacks of the app both 
initially and after prolonged use; 

3. The quality, relevance and suitability of the recommen¬ 
dations provided; 

4. Their understanding of context, their experiences and 
perceptions in receiving context-aware recommenda¬ 
tions and their understanding of the explanations that 
Frappe provided to them; 

5. Their use of Frappe in different places and situations 
and the relevance of recommendations in various con¬ 
textual settings; and finally 

6. Their attitudes towards control, preference settings, 
privacy and security. 

Participant responses were moderated and if more detail 
was required or responses were unclear, the moderator could 
probe participants in more detail. Participants were able to 
see the responses of fellow participants and were also free 
to engage in group-based discussions should they desire. At 
the end of each day, all forum responses were analyzed and 
underlying themes were identified. In this way findings from 
each day could feed into the questioning for the following 
day. This iterative approach was very beneficial and enabled 
us to gather a rich set of subjective insights (see Results 
section). 






Application 

Category 

^Installed 

Application 

Gategory 

Viewed 

Application 

Gategory 

#Used 

Time Lapse 

Photo 

25 

Flipboard 

News 

103 

Ghrome 

Browser 

170K 

Bubble Shooter 

Game 

11 

Expedia Hotels 

Travel 

49 

Gmail 

Email 

145K 

Pocket 

News 

11 

Time Lapse 

Photo 

49 

WhatsApp 

Messenger 

136K 

Bridge the Wall 

Game 

10 

Bubble Shooter 

Game 

48 

Facebook 

Social 

123K 

Clean Master 

Tool 

9 

Firefox 

Browser 

44 

GO Launcher EX 

Launcher 

94K 


Table 2: Most installed v.s. viewed v.s. used applications. 


5. RESULTS 

In this section we report the analysis of both the large- 
scale deployment and the small-scale user study. In total, 
Frappe was installed by approx. 1000 users, i.e. on 1000 
different mobile devices. During deployment we recorded 
usage of approx. 24K apps which equates to approx. 15.8 
unique apps per user. This highlights the large variety of 
apps available and the uniqueness of the user profile but also 
gives a sense of the magnitude of the app discovery problem. 
We cleaned the data and removed users who had problems 
in retrieving recommendations, or did not send any data to 
the server due to unknown technical difficulties. In the rest 
of the paper we will report the results based on the usage 
data from 986 users. 

5.1 Characterizing context-dependent mobile 
app usage 

A number of researchers have explored context-dependent 
mobile use in the past. For example, in 2006 Verkasalo con¬ 
ducted a study which investigated differences in mobile ser¬ 
vice usage in general across a variety of contexts like home, 
work, etc. [^. More recently Bohmer et al studied tem¬ 
poral patterns of mobile app usage using a dataset collected 
from over 4000 Android users between Aug 2010 and Jan 
2011. The key focus of that work was on durations, cate¬ 
gories and sequences of app usage. Given that Frappe tracks 
all apps used by all Frappe users {i.e. outside the scope of 
Frappe), the log data recorded during the large-scale deploy¬ 
ment provides us with a rich and varied dataset with which 
we can investigate context-dependent mobile app usage be¬ 
haviors based on a 2-month snapshot of 2013. Because of the 
high speed at which the mobile space evolves, it’s likely that 
new app usage patterns have emerged since prior related 
work. In this section we report on this characterization. 

In total we identified 2.3M app usage events. App usage 
is assumed if the users screen is on and there is an app run¬ 
ning in the foreground. Because of the context logging, we 
were also able to investigate where people use their apps and 
in what conditions. We found that 13.6% of app usage oc¬ 
curred at home, 3% at work and 83% at other locations. We 
infer the home and work location using a time-based heuris¬ 
tic, i.e. the most repeated location where the user is from 
lam to 6am is considered to be the home and most repeated 
location between 9am and 6pm is considered to be work. 
We tested this approach with 10 pilot users and obtained 
an accuracy of 95%. with only 1 false recognition of a work 
location, (all home locations were all correctly recognized), 
giving us confidence in the accuracy of the used heuristic. 
Our statistical analysis suggests that the location-based con¬ 
texts that Frappe identified (home, work, other) were of very 
different nature and are not random {p < 10“®|^ 

®Note that the inference of locations only takes place after 
Frappe has access to sufficient usage data, which was 3 days 
on average 


The mosaic plot in Figure]^ helps to visualize the influ¬ 
ence of context on the usage of apps for 4 popular app cate¬ 
gories (Social, Productivity, News and Communication). We 
show location context (home, work, other), the 4 app cate¬ 
gories and temporal context in the form of weekend vs work¬ 
day. The width of each tile is proportional to the marginal 
frequency of the location dimension, given time. If there is 
no dependency on time, the width of the corresponding tiles 
[e.g., work|weekend vs work|workday) in these two contexts 
would be the same. The red color indicates significant values 
(using statistic) below the average, and blue, significant 
values above the average. Hence, all the cells except for the 
bottom left cell — Social apps used at home on a workday 
— denote highly statistically significant results. 

The Figure provides rich insights regarding the contex¬ 
tual usage of apps with respect to location, time and cat¬ 
egory, showing that app usage is highly context dependent 
(at least for these popular 4 app categories). Communica¬ 
tion apps are by far the most frequently used app category 
both on weekends and weekdays and particularly while at 
home. Apps across all 4 categories are used less than aver¬ 
age at work on the weekend, which seems intuitive. Social 
and Communication apps are used significantly more than 
average while at home on the weekend, whereas Productiv¬ 
ity and News apps are used significantly more during work¬ 
days both at work and in other locations. Interestingly, on 
workdays, at home Communication apps are more frequently 
used, when compared to Social, Productivity or News apps. 

While our results highlight contextual influence in terms 
of categories of apps, we also analyzed the usage of three 
individual popular mobile apps: Chrome (indicative of Web 
browsing, online information access), Facebook (Social app) 
and Whatsapp (Communication, one of the most popular 
mobile instant messenger apps). Figure shows that us¬ 
age of Chrome is the least dependent on contextual factors, 
whereas, usage of Whatsapp is highly contextual with high 
statistical significance. The most interesting observation is 
the contrasting usage patterns between Whatsapp and Face- 
book at home during a working day. Facebook is used more 
than average on working days while at home, whereas What¬ 
sapp is used significantly less than average in that context. 
Conversely, Facebook is used less than average on weekends 
while at home and Whatsapp is used significantly more than 
average in that context. 

5.2 Insights from the Large-Scale Deployment 

This section focuses on findings from the large-scale de¬ 
ployment of Frappe. In total we logged 4437 sessions of 
Frappe usage. Each user had logged from 1 to 218 sessions 
(avg: 4.5, std: 13). In order to help us assess the effective¬ 
ness of the recommendations provided by Frappe, we adopt 
some of the approaches on usage-centric evaluation app rec¬ 
ommendations proposed in and compute app conversion 
rates at two stages of app engagement. The first is viewed- 
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Figure 4: App usage in various locations and day of the week. 
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Fignre 3: Mosaic plot of the contextual influence 
on the usage of applications. Usage counts of 4 app 
categories are displayed (Communication, News, So¬ 
cial, Productivity) 


mobile users are likely to be more comfortable with trying 
different new games rather than switching to a new commu¬ 
nications tool. 

Next, we investigated short term engagement, that is, if 
users used the app directly after the installation. We ob¬ 
tained a 57.8% conversion rate compared to 30%-50% re¬ 
ported in [^. We further analyzed the logs to get better 
insights on the success of the recommendations. On aver¬ 
age, users installed 0.16 applications per Frappe session, i.e., 
they installed 1 recommended app every 6.25 sessions of us¬ 
ing Frappe. 

In almost every session, users looked at the details of a rec¬ 
ommended app (on average 0.84 recommended app views 
per session). Users installed from 0 to 17 recommended 
apps (avg: 0.7, std: 1.82). The top installed and viewed 
applications can be seen in Tablej^together with their cate¬ 
gory. We also aggregated the statistics of the most used and 
viewed apps to determine the most installed and used cate¬ 
gories. The most installed categories were Tools, Social and 
Communication and the most viewed categories were Com¬ 
munication, Social and Travel. This is a surprising result, 
given that Social and Communication apps had the lowest 
conversion rates. 


to-installed {i.e. where the user views the app detail page of 
the recommended app and opts to install the app); the sec¬ 
ond is installed-to-direct usage. In total we had 3726 view 
app detail events and 714 of these led to an install of an 
app. This equates to a 19% conversion rate and is similar to 
the results reported in by a context-aware method. The 
view-to-installation conversion rate by a popularity method 
reported in is only 9%. Hence, the context-awareness pro¬ 
vided by Frappe’s recommendations is better than a stan¬ 
dard app popularity method. To investigate further, we ex¬ 
plored the conversion rates by app category (for categories 
that had at least 10 installs, to have more reliable statis¬ 
tics). Games (31%) have highest conversion rates, followed 
by news (26%) and music (23%). The lowest conversion 
rates were for Communication (11%) and Social (16%). We 
believe that games are a much easier category to recommend 
than social apps as games are more fad-like in nature and 


5.3 Small-Scale Study 

Due to technical difficulties associated with acquiring the 
participants device ID, we could not retrieve the log data 
from 7 participants from the small-scale study. While we 
have qualitative insights for 33 participants, the final num¬ 
ber of participants for whom we have mobile usage data 
and could do quantitative analysis is 26. In total we col¬ 
lected 646 sessions (min:2, max:85 sessions, avg:25.5, std: 
18.53 per user) from these 26 participants. They installed 
between 0 and 16 recommended apps (avg: 2.56, std: 3.9). 
On average users installed 0.1 applications per session (com¬ 
pared to 0.16 in the large deployment) and viewed the details 
of 0.68 recommended apps per session. We looked into the 
conversion rates for this population. The users had 14.4% 
view to install conversion and 34.7% direct usage conversion 
rates. This is lower than the statistics of the large scale 
deployment. 

Over the 3-week study period, the forum resulted in > 

































































2, 300 posts in the form of participant answers to questions. 
Note that all examples provided in this paper are actual 
participant responses from the forum. The vast majority of 
participants found Frappe very easy to use: 32 participants 
rated the ease of use of the app as 4 or 5 on a 5-point 
scale, with only one participant providing a neutral rat¬ 
ing of 3. The simple, clean layout and intuitive navigation 
made Frappe appear accessible and user-friendly. Partici¬ 
pants were very positive about the concept behind Frappe 
and believed that it could genuinely save them time, e.g.: 
"I really like it. Its got so many different features to keep me 
entertained its rather fun looking through the different recom¬ 
mendations. " and “I like that it explains why its recommending 
each app (time of day, similar apps), I like that the apps change" 
"I think its great, saves time plus recommends other apps you 
may like. I like it, good idea" Participants overwhelmingly 
reported that Frappe recommended apps that they had not 
heard of, or were unlikely to find otherwise, e.g., “I would say 
that around 80 per cent of the apps I have been recommended I 
had never come across before" and "The vast majority of these 
apps I would not have found or considered if they had not been 
recommend". 

However and despite installing some of the recommended 
apps provided by Frappe, in general, participants were quite 
discerning and selective when installing apps. Reasons for 
not installing recommended apps included: (1) the recom¬ 
mendations did not take their fancy at the specific time, 
(2) cynicism around new apps and a desire to review them 
first, (3) not necessarily perceiving a real benefit and (4) 
limited phone memory. For users who did install a few rec¬ 
ommended apps we found many positive comments related 
to their relevance, e.g. "..../ was Just arriving on a train and 
I was recommended a taxi app which was perfect. I would 
usually use Google to search for a local taxi company when Tm 
traveling for work.", and "As far as recommended apps at home 
they did seem quite relevant. My instruments are at home and 
so the metronome app was kindly welcomed." 

One of the most important issues expressed by partici¬ 
pants was that they could not see how they could improve 
or directly influence the recommendations provided, e.g. "I 
don't see how I can improve my recommendations. The same 
stuff is always recommended, but I have no way of telling it 
that I don't want it". While Frappe provides end-user recom¬ 
mendations in a automatic manner, many users expressed 
a desire to control what was recommended, how apps were 
recommended and to control the situations and categories in 
which those apps were recommended. For example, "I would 
like the opportunity to tell the app about the things I like and 
dislike, so that it can tailor my recommendations accordingly". 

One of our concerns with deploying the Frappe was pri¬ 
vacy. Given that context-aware systems use more informa¬ 
tion than classic RS, a lot of this information could rise 
significant privacy concerns for the user. As is shown in Ta- 
blej^we use location, time and install/usage information for 
apps. We found that many participants did not understand 
the exacting contextual data points used to generate their 
recommendations. In addition we found that the consensus 
reached by our participants was that if the recommendations 
improve and provide value, they do not have any problems 
sharing their location information. However, there needs to 
be clear benefit for them to share such information. 

6. LESSONS LEARNED 


In this section we outline the key learning outcomes from 
our field studies of Frappe. Some of these findings corrobo¬ 
rate previous work, while others complement it. We believe 
that these lessons would be beneficial to the MobileHCI com¬ 
munity, particularly to researchers and practitioners devel¬ 
oping mobile systems that include recommendations. 

6.1 Explaining the large with the small 

As suggested by Morrison et al. in hybrid approaches 
to mass participation trials we can use the small-scale study 
to explain the large-scale deployment. According to the 
usage-centric evaluation framework proposed by [^, the Frappe 
system performs well. We find higher app conversations rates 
(19%) in terms of views to installs when compared to stan¬ 
dard popularity based metrics (9%) and app aware filtering 
(7%). We also find higher conversion rates in terms of in¬ 
stalls to direct usage (57.8% vs. 30-50%) when compared to 
the approaches evaluated in The key issue, however, is 
that while, Frappe performs well based on this usage-centric 
evaluation approach, the small scale study highlighted some 
important drawbacks of the system that negatively effected 
end-user experiences. We will discuss each of this key issues 
below, however our first key lesson relates to the importance 
of using a hybrid approach in order to get a more complete 
picture of user experiences and perceptions of mobile app 
recommendations. 

6.1.1 Avoid Highly Irrelevant Recommendations 

Many (21 of 33) participants in the small-scale study re¬ 
ported issues related to irrelevant recommendations at least 
once during the study. We believe this is mainly due to the 
lack of data to train our complex models (cold-start prob¬ 
lem). Most of these issues were related to receiving app 
recommendations in a foreign language, e.g. "I have now re¬ 
ceived 3 different apps in foreign languages. I believe them 
to be all German but as I don't speak this language, I can¬ 
not say 100% it isl". Other comments signaled that Frappe 
lacked a basic understanding of the target audience for some 
of the apps, e.g. "I was offered the 'My Pregnancy Today' 
app, because I have a stopwatch app. This isn't really relevant 
as I am a male". These highly irrelevant recommendations 
have an extremely negative effect of end-user experiences. 
Considering that existing evaluations of RS focus mainly on 
precision/recall on relevant recommendations in their eval¬ 
uations, we suggest that future evaluations of RS, in partic¬ 
ular for mobile systems, take these highly irrelevant items 
into account. Paul Lamere coined a test for such irrelevant 
items called the 'WTF tes^in which a WTF score is com¬ 
puted by summing up highly irrelevant items in the top-N 
recommendation list. Thus we propose to adopt RS eval¬ 
uation metrics that take into account the severity of the 
mistakes (via WTF scores or other approaches), not just 
the number of mistakes. 

6.1.2 Learn Quickly 

Also of critical note is the time-frame a RS has to prove 
itself as effective. It is well known that new users first as¬ 
sess if they can trust a system and tend to be quite critical 
when doing so [23[ |12| . In our small-scale study the par¬ 
ticipants came to a consensus that they would give an app 
recommender between 1 week and 1 month to prove itself. 

®http://musicmachinery.com/2011/05/14/ 
how-good-is-googles-instant-mix/ 



However, when we analyzed the usage log data from the 
large-scale deployment we noticed a very different pattern. 
Specifically, unsatisfied users did not wait for a month, but 
uninstalled apps within an hour from its installation. While 
we don’t know entirely why app uninstalls took place, it 
is reasonable to assume that uninstall actions are a signal 
of disinterest or dislike of the recommended app. Thus it 
seems that mobile app RS have very little time to prove 
their worth, which means that it is imperative to get the 
recommendations right immediately. Moreover, we would 
argue that short-term uninstall conversion rate should also 
be a key part of any usage-centric app recommendation eval¬ 
uation measures [^. 


6.1.3 Automatic Discovery vy Explicit Control. 

Our aim with Frappe was to build a fully automated app 
discovery system, i.e., where recommendations are provided 
by observing implicit user actions without the need for ex¬ 
plicit input or feedback from the user. Participants of the 
small-scale study felt that the dialogue with Frappe was 
largely one-way, with minimal capacity for them to control 
their experience. When participants were asked about what 
features they would like to include in Frappe, the major¬ 
ity of these were related to controlling the recommendations 
and their preferences, e.g. 28 of the 33 participants indi¬ 
cated that would like a blacklist facility, i.e. the ability 
to tell Frappe when they do not like a recommended app 
and ensure that the app is never recommended again. 19 
of the 33 participants indicated a need for a favorites list 
where they can add apps they are interested in for later in¬ 
stallation. In addition participants wanted to personalize 
the service by telling Frappe more details about them, e.g., 
"It would have been handy to tell the app your workdays for 
example" Thus, participants wanted a two-way dialogue in¬ 
volving varying levels of feedback from them coupled with 
the ability to set preferences and control their experiences 
more explicitly. Providing these capabilities and learning 
from these preferences would lead to enriched end-user ex¬ 
periences. Critiquing-based RS 18 have been developed to 
precisely address this challenge. In future work, we would 
like to incorporate elements of mixed-initiative systems into 
Frappe. 


6.1.4 Use & Perceptions of Context 

While the majority of users expected their recommenda¬ 
tions to adapt to their current situation, at times Frappe did 
not meet their expectations in this regard. Users reported: 


1. Perceiving no adaptation based on their current situ¬ 
ation, e.g. "/ have used the app either at home or work 
and have not noticed any difference to be honest." 

2. Encountering irrelevant recommendations given to their 
current location, e.g. "When out and about I didn't no¬ 
tice that I was being recommended any travel apps or 
things related to location at all". 

3. Encountering irrelevant recommendations based on tem¬ 
poral context, e.g. "I get apps that are recommended 
because it is the weekend, I assume that the app thinks 
everybody is off work at the weekend, for me this isn't 
the case, my days off change every week.". 


Likewise, Frappe attemped to convey its context-awareness 
in the form of context-sensitive explanations (See Figure[T(d)[). 



Days since installation Hours since installation 


(a) Days (b) Hours 

Figure 5: Time since uninstall of the Frappe app. 


It used a broad definition of why the recommendation was 
provided, by saying “Recommended because your current 
situation is: Afternoon, Barcelona, Spain”. These explana¬ 
tions successfully communicated the use of the context to 
the end-user. However, some users (18%) perceived these 
explanations as too generic. "Android Music player... It said 
that the day of the week is Wednesday, the country is UK, and 
it is a workday. . . on those reasons it could have recommended 
anything!". While contextual data can help the algorithm to 
build a better predictive model, such data did not always 
help in explaining the recommendations. 

These findings reveal that users understand the concept 
of context dependent adaptations and already have expecta¬ 
tions about how a system should adapt based on their con¬ 
text. Moreover, the patterns of app usage in various contexts 
show that context-awareness is essential while modeling mo¬ 
bile users and their behaviors. However, user expectations 
that a contextual service will adapt to their current context 
are very high and are likely to be different for each user, 
therefore, they can be difficult to fulfill. 


6.2 Evaluation of CARS in the Mobile Space 

We have provided comparisons to the usage-centric app 
recommendation evaluation framework proposed by and 
have obtained good performance results. However, based 
on rich user insights collected from the small-scale study 
and our experiences via the large-scale deployment, we have 
a suggestion to enrich the framework proposed in which 
we believe will help researchers working in the mobile CARS 
space to better evaluate their mobile app recommenders. 

We believe that the direct usage conversion measure might 
not be the best measure to assess the guality of app recom¬ 
mendations. Usage that occurs straight after the installa¬ 
tion of an app is likely caused by curiosity rather than by 
the quality of the app recommendations. Mobile users must 
launch/open an app once installed to see what the app holds 
within. We would argue that a more important measure of 
quality (failure) is the direct uninstall of the application that 
was just installed. A variation of this measure is used by Ap- 
paware (personal communication with the appaware team) 
and Google plajj^ as one of the most important signals for 
the quality of an application. 

To showcase this measure, we computed the uninstall rates 
for the Frappe app. Figure 5(a) shows that the majority of 
uninstalls of Frappe occurred during the first day. Actually, 
most users uninstalled Frappe during the first hour (see Fig- 


^ https://developers.google.com/events/io/sessions/326335584 









































ure |5(b)] ), others within the first minutes. This highlights 
the fact that users evaluate the perceived quality and bene¬ 
fit of the RS (or any other app) within a very short period 
of time. As such if you provide the wrong recommendations 
within the first minutes, it is likely that the user’s loyalty 
to the RS will be threatened. By looking at not only these 
uninstall rates, but also the time between install and unin¬ 
stall, users can better assess the quality and effectiveness of 
their app recommendations. 

6.3 Predicting Usage vs. Recommending Apps 

fn most RS domains, usage of an item is taken as a pos¬ 
itive signal that the user liked the item in question. This 
signal can be noisy, but on average, consumption patterns 
can be used to generate reliable recommendations . This 
is a reasonable assumption in many domains where recom¬ 
mendations are commonplace {e.g. movies, music, etc...). 
However, it might not be appropriate in the app recommen¬ 
dation domain, or at least not for all categories of apps. 

fn fact, directly employing app usage data can lead to 
poor recommendations because usage patterns and app in¬ 
stallation patterns can be very different. Apps are typically 
installed to satisfy certain needs, which are highly depen¬ 
dent on the context. An app {i.e. Compass) might be used 
rarely, but might be of great importance to the user. Table[^ 
shows the top installed, viewed and used apps. Most of the 
top apps are browsers, launcher^and communication tools. 
Given that most users already have their favorite of these 
tools installed on their mobile phones, recommending simi¬ 
lar apps, even if these apps are not known to the user, might 
not be the most appropriate action to take. Even though re¬ 
cent work has shown that users install and use more than 
one app within the same category [^, in our experience, 
the participants of the small-scale study did not appreciate 
being recommended apps within the same category of apps 
that they frequently used. 

One might argue that installation data rather than us¬ 
age data could be used to train the RS models. However, 
after our 2 month deployment, we only recorded 714 install 
actions, compared to 2.3M apps usage actions. Thus it is un¬ 
likely that a service like Frappe will ever record a sufficient 
number of install actions to use such data for training the 
RS models. As such we believe there is a great opportunity 
for future research on novel implicit algorithms to provide 
enhanced mobile app recommendations. 

7. CONCLUSIONS 

We have presented insights and lessons learned from a 
hybrid study of a context-aware app recommendater called 
Frappe. Our results indicate that contextual variables, such 
as location and time are very important signals for model¬ 
ing app usage and providing recommendations. However, 
feedback collected in the small scale user study, shows that 
while users understand the value of context dependent adap¬ 
tation, their expectations in this regard are also very high. 
We provide a set of lessons learned which outline important 
considerations in designing, deploying and evaluating mobile 
context-aware RS in-the-wild with real users. 
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